pRactice corner: PCA Wine

lruolin

Summary

PCA is a data reduction technique, to uncover latent variables that are uncorrelated. It is an unsupervised way of classification. Not all of the variables in high-dimensional data are required. Some are highly correlated with others and these variables may be omitted, while retaining a similar level of information in the dataset in terms of explaining the variance.

It is used as an exploratory data analysis tool, and may be used for feature engineering and/or clustering.

Workflow

Import data
Exploratory data analysis

skim
ggcorr
ggpairs

Check assumptions on whether PCA can be carried out

KMO
Bartlett

Carry out PCA using tidymodels workflow

Always use only continuous variables, ensure that there are no missing data. Determine the number of components using eigenvalues, scree plots and parallel analysis.

recipe : preprocess the data (missing values, center and scale, ensuring that variables are continuous)
prep : evaluate the data
bake : get the PCA Scores results
visualize
communicate results: show the scree plot, PCA loadings, variance explained by each component, loadings and score plot.

The scores plot show the positions of the individual wine samples in the coordinate system of the PCs.

The loadings plot shows the contribution of the X variables to the PCs.

Loading packages

library(pacman)
p_load(corrr, palmerpenguins, GGally, tidymodels, tidytext, tidyverse, psych,
       skimr, gridExtra, kohonen, janitor, learntidymodels, kohonen)

Import

This dataset is from the kohonen package. It contains 177 rows and 13 columns.

These data are the results of chemical analyses of wines grown in the same region in Italy (Piedmont) but derived from three different cultivars: Nebbiolo, Barberas and Grignolino grapes. The wine from the Nebbiolo grape is called Barolo. The data contain the quantities of several constituents found in each of the three types of wines, as well as some spectroscopic variables.

The dataset requires some cleaning, and the type of wine was added to the datset.

data(wines)

wines <- as.data.frame(wines) %>% 
  janitor::clean_names() %>%  # require data.frame
  as_tibble() %>% 
  cbind(vintages)  # vintages = Y outcome = category
 
glimpse(wines)

Rows: 177
Columns: 14
$ alcohol          <dbl> 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, 1…
$ malic_acid       <dbl> 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1…
$ ash              <dbl> 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2…
$ ash_alkalinity   <dbl> 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1…
$ magnesium        <dbl> 100, 101, 113, 118, 112, 96, 121, 97, 98, 1…
$ tot_phenols      <dbl> 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2…
$ flavonoids       <dbl> 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2…
$ non_flav_phenols <dbl> 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0…
$ proanth          <dbl> 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1…
$ col_int          <dbl> 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5…
$ col_hue          <dbl> 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1…
$ od_ratio         <dbl> 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2…
$ proline          <dbl> 1050, 1185, 1480, 735, 1450, 1290, 1295, 10…
$ vintages         <fct> Barolo, Barolo, Barolo, Barolo, Barolo, Bar…

EDA

Some exploratory data analysis was carried out:

What are the types of variables? Categorical or numerical?
What is the distribution like? Skewed?
Are there any missing values?
Are there any outliers?
Check the types of wine
Are the variables quite correlated with each other?

skimr

skim(wines) # 177 x 13, all numeric + Y outcome

Table 1: Data summary
Name	wines
Number of rows	177
Number of columns	14
_______________________
Column type frequency:
factor	1
numeric	13
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
vintages	0	1	FALSE	3	Gri: 71, Bar: 58, Bar: 48

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
alcohol	1	12.99	0.81	11.03	12.36	13.05	13.67	14.83	▂▇▇▇▃
malic_acid	1	2.34	1.12	0.74	1.60	1.87	3.10	5.80	▇▅▂▂▁
ash	1	2.37	0.28	1.36	2.21	2.36	2.56	3.23	▁▂▇▅▁
ash_alkalinity	1	19.52	3.34	10.60	17.20	19.50	21.50	30.00	▁▆▇▃▁
magnesium	1	99.59	14.17	70.00	88.00	98.00	107.00	162.00	▅▇▃▁▁
tot_phenols	1	2.29	0.63	0.98	1.74	2.35	2.80	3.88	▅▇▇▇▁
flavonoids	1	2.02	1.00	0.34	1.20	2.13	2.86	5.08	▆▆▇▂▁
non_flav_phenols	1	0.36	0.12	0.13	0.27	0.34	0.44	0.66	▃▇▅▃▂
proanth	1	1.59	0.57	0.41	1.25	1.55	1.95	3.58	▃▇▆▂▁
col_int	1	5.05	2.32	1.28	3.21	4.68	6.20	13.00	▇▇▃▂▁
col_hue	1	0.96	0.23	0.48	0.78	0.96	1.12	1.71	▅▇▇▃▁
od_ratio	1	2.60	0.71	1.27	1.93	2.78	3.17	4.00	▆▃▆▇▂
proline	1	745.10	314.88	278.00	500.00	672.00	985.00	1680.00	▇▇▅▃▁

GGally

wines %>% 
  select(-vintages) %>% 
  ggcorr(label = T, label_alpha = T, label_round = 2)

wines %>% 
  ggpairs(aes(col = vintages))

Checking assumptions

Is the dataset suitable for PCA analysis?

# Continuous Y
# No missing data
# Check assumptions for PCA #####
wines_no_y <- wines %>% 
  select(-vintages)

glimpse(wines_no_y)

Rows: 177
Columns: 13
$ alcohol          <dbl> 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, 1…
$ malic_acid       <dbl> 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1…
$ ash              <dbl> 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2…
$ ash_alkalinity   <dbl> 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1…
$ magnesium        <dbl> 100, 101, 113, 118, 112, 96, 121, 97, 98, 1…
$ tot_phenols      <dbl> 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2…
$ flavonoids       <dbl> 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2…
$ non_flav_phenols <dbl> 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0…
$ proanth          <dbl> 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1…
$ col_int          <dbl> 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5…
$ col_hue          <dbl> 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1…
$ od_ratio         <dbl> 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2…
$ proline          <dbl> 1050, 1185, 1480, 735, 1450, 1290, 1295, 10…

# KMO: Indicates the proportion of variance in the variables that may be caused by underlying factors. High values (close to 1) indicate that factor analysis may be useful.
wines_no_y %>% 
  cor() %>% 
  KMO() # .70 above : YES

Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = .)
Overall MSA =  0.78
MSA for each item = 
         alcohol       malic_acid              ash   ash_alkalinity 
            0.73             0.80             0.44             0.68 
       magnesium      tot_phenols       flavonoids non_flav_phenols 
            0.67             0.87             0.81             0.82 
         proanth          col_int          col_hue         od_ratio 
            0.85             0.62             0.79             0.86 
         proline 
            0.81

# Bartlett's test of sphericity: tests the hypothesis that the correlation matrix is an identity matrix (ie variables are unrelated and not suitable for structure detection.) For factor analysis, the p. value should be <0.05.

wines_no_y %>% 
  cor() %>% 
  cortest.bartlett(., n = 177) # p<0.05

$chisq
[1] 1306.787

$p.value
[1] 3.302319e-222

$df
[1] 78

Tidymodels (PCA)

Recipe

With the use of update_role(), the types of wine information is retained in the dataset.

step_normalize() combines step_center() and step_scale()

Note that step_pca is the second step –> will need to retrieve the PCA results from the second list later.

wines_recipe <- recipe(~ ., data = wines) %>% 
  update_role(vintages, new_role = "id") %>%  
  # step_naomit(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors(), id = "pca")

wines_recipe # 13 predictors

Data Recipe

Inputs:

      role #variables
        id          1
 predictor         13

Operations:

Centering and scaling for all_predictors()
No PCA components were extracted.

Preparation

wines_prep <- prep(wines_recipe)

wines_prep # trained

Data Recipe

Inputs:

      role #variables
        id          1
 predictor         13

Training data contained 177 data points and no missing data.

Operations:

Centering and scaling for alcohol, malic_acid, ... [trained]
PCA extraction with alcohol, malic_acid, ... [trained]

tidy_pca_loadings <- wines_prep %>% 
  tidy(id = "pca")

tidy_pca_loadings # values here are the loading

# A tibble: 169 x 4
   terms               value component id   
   <chr>               <dbl> <chr>     <chr>
 1 alcohol          -0.138   PC1       pca  
 2 malic_acid        0.246   PC1       pca  
 3 ash               0.00432 PC1       pca  
 4 ash_alkalinity    0.237   PC1       pca  
 5 magnesium        -0.135   PC1       pca  
 6 tot_phenols      -0.396   PC1       pca  
 7 flavonoids       -0.424   PC1       pca  
 8 non_flav_phenols  0.299   PC1       pca  
 9 proanth          -0.313   PC1       pca  
10 col_int           0.0933  PC1       pca  
# … with 159 more rows

Bake

wines_bake <- bake(wines_prep, wines)
wines_bake  # has the PCA LOADING VECTORS that we are familiar with

# A tibble: 177 x 6
   vintages   PC1    PC2    PC3      PC4     PC5
   <fct>    <dbl>  <dbl>  <dbl>    <dbl>   <dbl>
 1 Barolo   -2.22 -0.301 -2.03   0.281   -0.259 
 2 Barolo   -2.52  1.06   0.974 -0.734   -0.198 
 3 Barolo   -3.74  2.80  -0.180 -0.575   -0.257 
 4 Barolo   -1.02  0.886  2.02   0.432    0.274 
 5 Barolo   -3.04  2.16  -0.637  0.486   -0.630 
 6 Barolo   -2.45  1.20  -0.985  0.00466 -1.03  
 7 Barolo   -2.06  1.64   0.143  1.20     0.0105
 8 Barolo   -2.51  0.958 -1.78  -0.104   -0.871 
 9 Barolo   -2.76  0.822 -0.986 -0.374   -0.437 
10 Barolo   -3.48  1.35  -0.428 -0.0399  -0.316 
# … with 167 more rows

Check number of PC

# a. Eigenvalues: Keep components greater than 1
# data is stored in penguins_prep, step 3

wines_prep$steps[[2]]$res$sdev # 3

 [1] 2.1628220 1.5815708 1.2055413 0.9614802 0.9282978 0.8030241
 [7] 0.7429548 0.5922321 0.5377546 0.4967984 0.4748054 0.4103374
[13] 0.3224124

# b. Scree plot/Variance plot

wines_prep %>% 
  tidy(id = "pca", type = "variance") %>% 
  filter(terms ==  "percent variance") %>% 
  ggplot(aes(x = component, y = value)) +
  geom_point(size = 2) +
  geom_line(size = 1) +
  scale_x_continuous(breaks = 1:4) +
  labs(title = "% Variance explained",
       y = "% total variance",
       x = "PC",
       caption = "Source: ChemometricswithR book") +
  geom_text(aes(label = round(value, 2)), vjust = -0.3, size = 4) +
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        axis.text = element_text(size = 10),
        plot.title = element_text(size = 14, face = "bold"))  # 2 or 3

# bii: Cumulative variance plot

wines_prep %>% 
  tidy(id = "pca", type = "variance") %>% 
  filter(terms == "cumulative percent variance") %>%
  ggplot(aes(component, value)) +
  geom_col(fill= "forestgreen") +
  labs(x = "Principal Components", 
       y = "Cumulative variance explained (%)",
       title = "Cumulative Variance explained") +
  geom_text(aes(label = round(value, 2)), vjust = -0.2, size = 4) +
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        axis.text = element_text(size = 10),
        plot.title = element_text(size = 14, face = "bold"))

# c. Parallel analysis

fa.parallel(cor(wines_no_y),
            n.obs = 333,
            cor = "cor",
            plot = T)  # 3

Parallel analysis suggests that the number of factors =  4  and the number of components =  3

Visualize

Loadings plot

plot_loadings <- tidy_pca_loadings %>% 
  filter(component %in% c("PC1", "PC2", "PC3", "PC4")) %>% 
  mutate(terms = tidytext::reorder_within(terms, 
                                          abs(value), 
                                          component)) %>% 
  ggplot(aes(abs(value), terms, fill = value>0)) +
  geom_col() +
  facet_wrap( ~component, scales = "free_y") +
  scale_y_reordered() + # appends ___ and then the facet at the end of each string
  scale_fill_manual(values = c("deepskyblue4", "darkorange")) +
  labs( x = "absolute value of contribution",
        y = NULL,
        fill = "Positive?",
        title = "PCA Loadings Plot",
        subtitle = "Number of PC should be 3, compare the pos and the neg",
        caption = "Source: ChemometricswithR") +
  theme_minimal()


plot_loadings

# PC1: flavonoids, tot_phenols, od_ratio, proanthocyanidins, col_hue, 36%
# PC2: col_int, alcohol, proline, ash, magnesium; 19.2%
# PC3: ash, ash_alkalinity, non_flav phenols; 11.2%
# PC4: malic acid?

An alternative way to plot:

# alternate plot loadings

learntidymodels::plot_top_loadings(wines_prep,
                  component_number <= 4, n = 5) +
  scale_fill_manual(values = c("deepskyblue4", "darkorange")) +
  theme_minimal()

Loadings only

# define arrow style
arrow_style <- arrow(angle = 30,
                     length = unit(0.2, "inches"),
                     type = "closed")

# get pca loadings into wider format
pca_loadings_wider <- tidy_pca_loadings%>% 
  pivot_wider(names_from = component, id_cols = terms)


pca_loadings_only <- pca_loadings_wider %>% 
  ggplot(aes(x = PC1, y = PC2)) +
  geom_segment(aes(xend = PC1, yend = PC2),
               x = 0, 
               y = 0,
               arrow = arrow_style) +
  ggrepel::geom_text_repel(aes(x = PC1, y = PC2, label = terms),
            hjust = 0, 
            vjust = 1,
            size = 5,
            color = "red") +
  labs(title = "Loadings on PCs 1 and 2 for normalized data") +
  theme_classic()

Scores plot

# Scores plot #####
# PCA SCORES are in bake
pc1pc2_scores_plot <- wines_bake %>% 
  ggplot(aes(PC1, PC2)) +
  geom_point(aes(color = vintages, shape = vintages), 
             alpha = 0.8, size = 2) +
  scale_color_manual(values = c("deepskyblue4", "darkorange", "purple")) +
  labs(title = "Scores on PCs 1 and 2 for normalized data",
       x = "PC1 (36%)",
       y = "PC2 (19.2%)") +
  theme_classic() +
  theme(legend.position = "none")

Finalised plots

grid.arrange(pc1pc2_scores_plot, pca_loadings_only, ncol = 2)

Check against Data

wines %>% 
  group_by(vintages) %>% 
  summarise(across(c(flavonoids, col_int, ash, malic_acid),
                   mean,
                   na.rm = T))

# A tibble: 3 x 5
  vintages   flavonoids col_int   ash malic_acid
  <fct>           <dbl>   <dbl> <dbl>      <dbl>
1 Barbera         0.781    7.40  2.44       3.33
2 Barolo          2.98     5.53  2.46       2.02
3 Grignolino      2.08     3.09  2.24       1.93

Interpretation of results

PCA allows for exploratory characterizing of x variables that are associated with each other.

PC1: flavanoids, total phenols, OD_ratio. PC2: color intensity, alcohol, proline PC3: ash, ash_alkalinity PC4: malic acid (by right 3 components are sufficient)

Barbera, indicated in blue, has the largest score on PC 1 and PC2. Barolo, indicated in orange, has the smallest score on PC 1. Grignolo, indicated in purple, has the lowest score on PC 2.

Barbera has low flavonoids, high col_int and high malic acid Barolo has high flavonoids, medium col_int and intermediate malic acid Grignolino has intermediate flavonoids, high col_int and low malic acid.

References

Comment on this article Share:

PCA Wine